MTG-Link: leveraging barcode information from linked-reads to assemble specific loci

Background Local assembly with short and long reads has proven to be very useful in many applications: reconstruction of the sequence of a locus of interest, gap-filling in draft assemblies, as well as alternative allele reconstruction of large Structural Variants. Whereas linked-read technologies have a great potential to assemble specific loci as they provide long-range information while maintaining the power and accuracy of short-read sequencing, there is a lack of local assembly tools for linked-read data. Results We present MTG-Link, a novel local assembly tool dedicated to linked-reads. The originality of the method lies in its read subsampling step which takes advantage of the barcode information contained in linked-reads mapped in flanking regions. We validated our approach on several datasets from different linked-read technologies. We show that MTG-Link is able to assemble successfully large sequences, up to dozens of Kb. We also demonstrate that the read subsampling step of MTG-Link considerably improves the local assembly of specific loci compared to other existing short-read local assembly tools. Furthermore, MTG-Link was able to fully characterize large insertion variants and deletion breakpoints in a human genome and to reconstruct dark regions in clinically-relevant human genes. It also improved the contiguity of a 1.3 Mb locus of biological interest in several individual genomes of the mimetic butterfly Heliconius numata. Conclusions MTG-Link is an efficient local assembly tool designed for different linked-read sequencing technologies. MTG-Link source code is available at https://github.com/anne-gcd/MTG-Link and as a Bioconda package. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05395-w.


Linked-read datasets
MTG-Link was used on various linked-read datasets, obtained from two different linked-read technologies: • stLFR (Homo sapiens): sequencing of the HG002 human individual with the stLFR technology.
The BAM/FASTQ files are provided by the Genome In A Bottle consortium and were downloaded from the following link: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/ HG002_NA24385_son/stLFR/. We considered the assembly GRCh37 (hg19 version) as the human reference genome.
• 10x Genomics (Heliconius numata): sequencing with the 10x Genomics Chromium technology of 12 individuals genomes of the butterfly Heliconius numata. The datasets of each individual are available under the SRA Project ID PRJNA676017.
For the stLFR dataset, the BAM and FASTQ files were pre-processed as indicated in the LRez publication, so that the barcodes sequences are reported using the BX:Z tag in the alignment tags of BAM files and in the reads' headers of FASTQ files. The 10x Genomics dataset was already formatted as such.
Besides, for the stLFR dataset, the raw input BAM file contains PCR duplicates that are not removed but only marked with Picard. Thus, these PCR duplicates were removed using Samtools and a new FASTQ file was generated from the BAM file using samtools bam2fq.

Local assembly tools command lines and parameters
For each local assembly tool tested, we provide the command lines that were used below.
Running MTG-Link: MTG-Link was run with the default read subsampling parameters values: -flank 10000 -occ 2 ; and with the default following assembly parameters values: -ext 500 -a 3 2. For large target sizes (10 and 20 Kb), the -l parameter (maxLength) was set to 50000. Otherwise, it was set to the default parameter value (-l 10000). mtglink.py DBG -gfa gfaFile.gfa -bam bamFile.bam -fastq readsFile.fastq.gz \ -index barcodeIndex.bci -t 4 -k 61 51 41 31 21 -nb-cores 8 Running MindTheGap: MindTheGap was run with two k-mer sizes: the default k-mer value (31 bp) and also the k-mer size giving the best results with MTG-Link on our datasets (51 bp). In the publication, we present only the results obtained with a k-mer size of 51 bp, as we obtained better assembly results with this k-mer value. Besides, MindTheGap was run in the same conditions as MTG-Link, e.g. with -max-nodes 1000. For large target sizes (10 and 20 Kb), the -max-length parameter was set to 50000. Otherwise, it was set to the default parameter value (-max-length 10000). The -bkpt parameter corresponds to a Fasta file containing the flanking k-mer sequences for each target.

Identification of inter-scaffold gaps in the H. numata Supergene P locus
For the inter-scaffold gap-filling application on the Supergene P locus of the butterfly Heliconius numata, for each individual, the large scaffolds identified as belonging to the locus are first ordered using the number of their common barcodes calculated by LRez compare. To do this, a file containing regions of interest in format chromosome:startPosition-endPosition is required (regionsFile.lst). A matrix file containing the number of common barcodes between all possibles pairs of scaffolds' extremities is thus obtained. Then, this file is converted to a GFA file using the script provided in the utils/ folder of the MTG-Link GitHub repository.     Table S4: Results of MTG-Link on the inter-scaffold gap-filling of the Supergene P locus in eight H. numata individuals. The linked-read datasets have different characteristics between individuals (average read depth and N50 of the whole draft assembly). The ratio of gaps filled by MTG-Link corresponds to the number of gaps successfully filled by MTG-Link over the number of initial gaps identified from the scaffold re-ordering step using the number of their common barcodes. Are also reported in this table the number of scaffolds in the locus before and after the local assembly with MTG-Link, as well as the total number of base-pair (bp) assembled for each individual.